Fusion Models for Improved Image Captioning
نویسندگان
چکیده
Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image models are trained on human annotated datasets such as Flickr30k and MS-COCO, which limited in size diversity. This limitation hinders the generalization capabilities of these while also rendering them liable making mistakes. Language can, however, be vast amounts freely available unlabelled data have recently emerged successful language encoders coherent text generators. Meanwhile, several unimodal multimodal fusion techniques been proven work well for natural generation automatic speech recognition. Building recent developments, with aim improving quality generated captions, contribution our this paper is two-fold: First, we propose a generic model framework caption emendation where utilize different strategies integrate pretrained Auxiliary Model (AuxLM) within traditional encoder-decoder visual frameworks. Next, employ same Masked (MLM), namely BERT, model, viz. Show, Attend, Tell, emending both syntactic semantic errors captions. Our experiments three benchmark datasets, Flickr8k, Flickr30k, MSCOCO, show improvements over baseline, indicating usefulness proposed strategies. Further, perform preliminary qualitative analysis emended captions identify error categories based type corrections.
منابع مشابه
Can Saliency Information Benefit Image Captioning Models?
To bridge the gap between humans and machines in image understanding and describing, we need further insight into how people describe a perceived scene. In this paper, we study the agreement between bottom-up saliency-based visual attention and object referrals in scene description constructs. We investigate the properties of human-written descriptions and machine-generated ones. We then propos...
متن کاملLanguage Models for Image Captioning: The Quirks and What Works
Two recent approaches have achieved state-of-the-art results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a...
متن کاملContrastive Learning for Image Captioning
Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learn...
متن کاملStack-Captioning: Coarse-to-Fine Learning for Image Captioning
The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multistage prediction framework for image captioning, composed of multiple decoders each of which...
متن کاملPhrase-based Image Captioning
Generating a novel textual description of an image is an interesting problem that connects computer vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. This model has a strong focus on the syntax of the descriptions. We train a purely bilinear model that learns a metric between an image representat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2021
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-030-68780-9_32